This report describes the process behind the creation of a Machine Learning Model used to classify weight lifting exercise (unilateral dumbbell biceps curling) in classes:
More about the research and data used can be found on the following website: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz6CzLP0YxO
TODO: RESULTADO DO MODELO
Our data source urls:
TRAINING_SOURCE_FILE_URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
TESTING_SOURCE_FILE_URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
Loading and splitting the data (training, validating and testing):
NA_STRINGS <- c("NA","#DIV/0!")
training <- read.csv(TRAINING_FILE_PATH, na.strings = NA_STRINGS)
testing <- read.csv(TESTING_FILE_PATH, na.strings = NA_STRINGS)
in.training <- createDataPartition(y = training$class, p = 0.7, list = FALSE)
validating <- training[-in.training, ]
training <- training[in.training, ]
How our training dataset looks like:
dim(training)
## [1] 13737 160
print(table(training$class))
##
## A B C D E
## 3906 2658 2396 2252 2525
Checking the presence of NAs per variable:
na.stats
##
## (-0.001,0.05] (0.95,1]
## 60 100
For a large number of variables, they have 95% of more of NAs values. These variables will be ignored on our models.
Outliers.
You can see more details about the training dataset on the Appendix.
Remove unecessary columns
NZV
Outlier removal.
Normalization?
Models
Variables there are being ignored:
unwanted.columns
## [1] "X" "raw_timestamp_part_1" "raw_timestamp_part_2"
## [4] "cvtd_timestamp" "new_window" "num_window"
## [7] "user_name"
almost.empty.columns
## [1] "kurtosis_yaw_belt" "skewness_yaw_belt"
## [3] "kurtosis_yaw_dumbbell" "skewness_yaw_dumbbell"
## [5] "kurtosis_yaw_forearm" "skewness_yaw_forearm"
## [7] "kurtosis_picth_forearm" "skewness_pitch_forearm"
## [9] "kurtosis_roll_forearm" "skewness_roll_forearm"
## [11] "max_yaw_forearm" "min_yaw_forearm"
## [13] "amplitude_yaw_forearm" "kurtosis_picth_arm"
## [15] "skewness_pitch_arm" "kurtosis_roll_arm"
## [17] "skewness_roll_arm" "kurtosis_picth_belt"
## [19] "skewness_roll_belt.1" "kurtosis_yaw_arm"
## [21] "skewness_yaw_arm" "kurtosis_roll_belt"
## [23] "skewness_roll_belt" "max_yaw_belt"
## [25] "min_yaw_belt" "amplitude_yaw_belt"
## [27] "kurtosis_roll_dumbbell" "skewness_roll_dumbbell"
## [29] "max_yaw_dumbbell" "min_yaw_dumbbell"
## [31] "amplitude_yaw_dumbbell" "kurtosis_picth_dumbbell"
## [33] "skewness_pitch_dumbbell" "max_roll_belt"
## [35] "max_picth_belt" "min_roll_belt"
## [37] "min_pitch_belt" "amplitude_roll_belt"
## [39] "amplitude_pitch_belt" "var_total_accel_belt"
## [41] "avg_roll_belt" "stddev_roll_belt"
## [43] "var_roll_belt" "avg_pitch_belt"
## [45] "stddev_pitch_belt" "var_pitch_belt"
## [47] "avg_yaw_belt" "stddev_yaw_belt"
## [49] "var_yaw_belt" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "max_roll_arm"
## [61] "max_picth_arm" "max_yaw_arm"
## [63] "min_roll_arm" "min_pitch_arm"
## [65] "min_yaw_arm" "amplitude_roll_arm"
## [67] "amplitude_pitch_arm" "amplitude_yaw_arm"
## [69] "max_roll_dumbbell" "max_picth_dumbbell"
## [71] "min_roll_dumbbell" "min_pitch_dumbbell"
## [73] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [75] "var_accel_dumbbell" "avg_roll_dumbbell"
## [77] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [79] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [81] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [83] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [85] "max_roll_forearm" "max_picth_forearm"
## [87] "min_roll_forearm" "min_pitch_forearm"
## [89] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [91] "var_accel_forearm" "avg_roll_forearm"
## [93] "stddev_roll_forearm" "var_roll_forearm"
## [95] "avg_pitch_forearm" "stddev_pitch_forearm"
## [97] "var_pitch_forearm" "avg_yaw_forearm"
## [99] "stddev_yaw_forearm" "var_yaw_forearm"
With the removal of variables with too many NAs, there are no more NearZeroVars as well:
nzv <- nearZeroVar(training, saveMetrics = TRUE)
print(nzv[nzv$nzv,])
## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
Boxplot for each numeric variable per classe: